Notes
Class imbalance is a quite common problem when working with real-world data.
This occours when examples from one class or multiple classes are over-represented in a dataset (e.g. spam filtering, fraud detection, disease screening).
Default: Customer default records for a credit card company
"We are interested in predicting whether an individual will default on his or her credit card payment, on the basis of annual income and monthly credit card balance."$^5$
| default | balance | income | |
|---|---|---|---|
| 1 | No | 729.526495 | 44361.62507 |
| 2 | No | 817.180407 | 12106.13470 |
| 3 | No | 1073.549164 | 31767.13895 |
| 4 | No | 529.250605 | 35704.49394 |
| 5 | No | 785.655883 | 38463.49588 |
Notes
"We have plotted annual income and monthly credit card balance for a subset of 10, 000 individuals"5
"It appears that individuals who defaulted tended to have higher credit card balances than those who did not."5
Below is a recreation of figure 4.15
Class Distribution (%)
No 96.67 Yes 3.33 Name: default, dtype: float64
Notes
Notes
train_test_splitWe'll do a random search and we can find a model with high accuracy.
| param_svm_clf__C | mean_test_score | std_test_score | |
|---|---|---|---|
| 39 | 13.958286 | 0.973210 | 0.003349 |
| 44 | 1.129434 | 0.972963 | 0.003252 |
| 36 | 2.1321 | 0.972963 | 0.003252 |
| 24 | 17.404635 | 0.972963 | 0.003252 |
| 22 | 1.793643 | 0.972963 | 0.003252 |
Best Linear Model Accuracy: 96.56%
However, this is not much better than a completely useless model that only predicts "No".
Useless Model Accuracy: 96.11%
Notes
This binary classifier can make two types of errors$^5$:
While the overall error rate is low, the error rate among individuals who defaulted is very high.
Notes
Error and Accuracy$^1$
Gives general performance information regarding the number of all correct or false predictions comparative to the total number of predictions for both the positive and negative labels.
$$ \begin{align} ERR &= \frac{FP+FN}{FP+FN+TP+TN} \\ \\ ACC &= 1-ERR \end{align} $$Accuracy (ACC): 0.966
Precision (PRE)$^1$
Precision (PRE): 0.700
Recall (or True Positive Rate)$^1$
Calculates how many of the actual positives our model correctly or incorrectly labelled.
This is useful when the fraction of correctly or misclassified samples in the positive class are of interest.
Recall (REC): 0.200
F1-score$^1$
F1-Score (F1): 0.311
We can use a classification report, which gives more information such as the macro avg and weighted avg.
Macro Average
Weighted Average
Notes
"yes" class. This is because in binary classification problems, the default positive label is the target (class 1). You can change this if you are more interested in the other classes performance or the average metrics.| No | Yes | accuracy | macro avg | weighted avg | |
|---|---|---|---|---|---|
| precision | 0.97 | 0.70 | 0.97 | 0.83 | 0.96 |
| recall | 1.00 | 0.20 | 0.97 | 0.60 | 0.97 |
| f1-score | 0.98 | 0.31 | 0.97 | 0.65 | 0.96 |
| support | 865.00 | 35.00 | 0.97 | 900.00 | 900.00 |
Notes
Training Error > Test Error$^5$
Notes
Optimising for Accuracy
During hyperparamter cross-validation we are choosing the model with the best overall accuracy.
This gives us a model with the smallest possible total number of misclassified observations, irrespective of which class the errors come from$^5$.
ML algorithms typically optimize a reward or cost function computed as a sum over the training examples, the decision rule is likely going to be biased toward the majority class$^9$.
Notes
There are a number of methods available to address imbalances in a dataset, such as:
Some of the folds may not have the same amount of data in, so the validation error we get from models may be a poor estimate of performance.
KFold
| Fold 0 | Fold 1 | Fold 2 | Fold 3 | Fold 4 | |
|---|---|---|---|---|---|
| 0 | 1565 | 1572 | 1577 | 1570 | 1560 |
| 1 | 55 | 48 | 43 | 50 | 60 |
StratifiedKFold
| Fold 0 | Fold 1 | Fold 2 | Fold 3 | Fold 4 | |
|---|---|---|---|---|---|
| 0 | 1568 | 1569 | 1569 | 1569 | 1569 |
| 1 | 52 | 51 | 51 | 51 | 51 |
During model fitting we can assign a larger penalty to wrong predictions on the minority class.
The heuristic used for class_weight="balanced" in Scikit-Learn (0.23.1) is:
where $n$ are the number of samples, $Nc$ the number of classes, $I$ is an indicator function, and $S$ contains the class elements.
| param_svm_clf__class_weight | param_svm_clf__C | mean_test_accuracy | std_test_accuracy | |
|---|---|---|---|---|
| 12 | None | 12.457135 | 0.972963 | 0.002512 |
| 19 | None | 20.185646 | 0.972963 | 0.002512 |
| 16 | None | 17.404635 | 0.972963 | 0.002512 |
| 0 | None | 5.620905 | 0.972963 | 0.002388 |
| 52 | None | 4.498096 | 0.972963 | 0.002388 |
C:\Users\delliot2\.conda\envs\mlp_pip\lib\site-packages\sklearn\svm\_base.py:985: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
warnings.warn("Liblinear failed to converge, increase "
Extra
So far in these notes we have been using a standard classifcation table fom Scikit-Learn, however we may wish instead to use one more suited to imballanced data.
TODO
Notes
| No | Yes | accuracy | macro avg | weighted avg | |
|---|---|---|---|---|---|
| precision | 0.97 | 0.73 | 0.97 | 0.85 | 0.96 |
| recall | 1.00 | 0.23 | 0.97 | 0.61 | 0.97 |
| f1-score | 0.98 | 0.35 | 0.97 | 0.67 | 0.96 |
| support | 865.00 | 35.00 | 0.97 | 900.00 | 900.00 |
| No | Yes | avg / total | |
|---|---|---|---|
| precision | 0.97 | 0.73 | 0.96 |
| recall | 1.00 | 0.23 | 0.97 |
| specificity | 0.23 | 1.00 | 0.26 |
| f1-score | 0.98 | 0.35 | 0.96 |
| geo | 0.48 | 0.48 | 0.48 |
| iba | 0.25 | 0.21 | 0.24 |
| support | 865.00 | 35.00 | 900.00 |
Changing the metric for what is defined as the "best model" can help us prioritise models that make particular errors.
For example, a credit card company might particularly wish to avoid incorrectly classifying an individual who will default, whereas incorrectly classifying an individual who will not default, though still to be avoided, is less problematic.
In this case, recall would therefore be a useful metric to use.
Notes
balanced models are indeed better if we want a good average recall or f1| param_svm_clf__class_weight | param_svm_clf__C | mean_test_recall | std_test_recall | |
|---|---|---|---|---|
| 9 | balanced | 0.031589 | 0.882730 | 0.021834 |
| 10 | balanced | 0.397242 | 0.878808 | 0.029134 |
| 37 | balanced | 2.889491 | 0.878808 | 0.029134 |
| 1 | balanced | 0.397409 | 0.878808 | 0.029134 |
| 14 | balanced | 0.216116 | 0.878808 | 0.029134 |
| param_svm_clf__class_weight | param_svm_clf__C | mean_test_f1 | std_test_f1 | |
|---|---|---|---|---|
| 32 | balanced | 61.146821 | 0.505370 | 0.035163 |
| 41 | balanced | 55.477953 | 0.481678 | 0.035227 |
| 23 | balanced | 76.945463 | 0.478551 | 0.054680 |
| 18 | balanced | 38.980718 | 0.450244 | 0.038279 |
| 36 | balanced | 124.327323 | 0.408983 | 0.033208 |
C:\Users\delliot2\.conda\envs\mlp_pip\lib\site-packages\sklearn\svm\_base.py:985: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
warnings.warn("Liblinear failed to converge, increase "
Extra
This is just some additional visualisations.
Providing your comfortable using metrics instead of relying on a confusion_matrix, you can use more of your training data by just using the multiple metrics in the CvGridSearch.
From here on in, I will get rid of my separate training and validation sets and I will just use "recall" as our metric of interest.
Notes
RandomUnderSampler is part of the Imblearn package, which allows for a lot of techniques for working with imballanced data.resample method in scikit-learn but Imblearn is a bit smoother to work with.Note
Data can be oversampled easily by randomly sampling from minority classes with replacement to duplicate original samples.
Notes
We can see if a RBF improves things although, if you plan on running this yourself (overwrite=True), this is computationally expensive.
A number of undersampling methods use heuristics based on k-nearest neighbors (KNN) classification8. KNN finds a number of samples that are the most similar to a data point we want to classify, based on a given distance metric, with its assigned class label depending on a majority vote by the nearest neighbours9 (we'll come back to this later). NearMiss uses this by selecting samples in the class to be under-sampled where the average distance to the closest or farthest samples of the minority class is smallest10.
Undersampling techniques also include data cleaning rules, where the number of samples in classes are not specified, but data is edited based on methods such as removing data dissimilar to their neighbourhood11 or by removing one or both samples in different classes when they are nearest neighbors of each other12.
Instead of just randomly oversampling there are also available approaches that generate new samples through the use of interpolation, such as SMOTE and ADASYN. However these methods can generate noisy samples so cleaning rule can be applied after oversampling13.
There is no one best approach, its typically dependent on the data and the aims for the model.
Below are examples of cross-validation scores for the best models (according to recall) for the different approaches.
Notes
Using the figure above, for the client who wants the model to prioritise avoiding incorrectly classifying an individual who will default, we would probably choose the undersampled linear SVM.
As we can see on the test set, we get similar scores as we did on the validation.
Notes
| No | Yes | avg / total | |
|---|---|---|---|
| precision | 0.99 | 0.25 | 0.96 |
| recall | 0.89 | 0.88 | 0.89 |
| specificity | 0.88 | 0.89 | 0.88 |
| f1-score | 0.94 | 0.39 | 0.91 |
| geo | 0.88 | 0.88 | 0.88 |
| iba | 0.78 | 0.78 | 0.78 |
| support | 958.00 | 42.00 | 1000.00 |
Would adding in if someone is a student improve the model?
Note
| default | student | balance | income | |
|---|---|---|---|---|
| 1 | No | No | 729.526495 | 44361.62507 |
| 2 | No | Yes | 817.180407 | 12106.13470 |
| 3 | No | No | 1073.549164 | 31767.13895 |
| 4 | No | No | 529.250605 | 35704.49394 |
| 5 | No | No | 785.655883 | 38463.49588 |
Note
OneHotEncoder into the pipeline but not do it for continous dataSo in this case it does not seem to improve the metric of most interest (recall), although did improve precision at the expense of recall.
| score | |||
|---|---|---|---|
| metric | f1 | precision | recall |
| model | |||
| Undersample Linear SVM | 0.318034 | 0.198239 | 0.879836 |
| Undersample New Linear SVM | 0.353487 | 0.261527 | 0.801110 |
| Undersample New RBF SVM | 0.306990 | 0.189444 | 0.824605 |
| Undersample RBF SVM | 0.307376 | 0.191444 | 0.793571 |
Fairness
Imagine we used the model in practice, and those deemed more likely to default were given more unfavourable terms due to them being seeming more risky according to the algorithm?
Imagine adding their student status had improved our model. Would it be fair to judge their chances of defaulting based on their student status rather than their actual financial information?
Maybe we need to know more about the people applying for the loan, but what variables do we use?
This is a difficult but important aspect of ML and I leave it here to make you think about the possibilities. We'll be exploring this more in the last week.
Notes
Some models (e.g. tree-based classifiers) are inherently multiclass, whereas other machine learning algorithms are able to be extended to multi-class classification using techniques such as the One-versus-Rest or One-versus-One methods3.
The One-verses-all approach is were you train a classifier for each class and select the class from the classifier that outputs the highest score$^3$.
In other terms, if we fit $K$ SVMs, we assign a test observation, $x^*$, to the class for which $\beta_{0k} + \beta_{1k}x^*_1, ...,\beta_{pk}x^*_p$ is largest (the most confident)5.
Advantage: As each class is fitted against all other classes for each classifier, it is relatively interpretable$^{14}$.
Disadvantages: Can result in ambiguious decision regions (e.g. could be class 1 or class 2), and classifiers could suffer from issues of class imballance$^4$.
Notes
sklearn.svm.SVC, you could also put the SVC inside sklearn.multiclass.OneVsRestClassifier| param_svm_clf__class_weight | param_svm_clf__C | param_svm_clf__gamma | mean_test_score | std_test_score | |
|---|---|---|---|---|---|
| 23 | balanced | 42.715424 | 0.001768 | 0.966667 | 0.018856 |
| 50 | balanced | 11.649174 | 0.002483 | 0.966667 | 0.018856 |
| 25 | None | 106.523159 | 0.001016 | 0.966667 | 0.018856 |
| 51 | balanced | 2.137057 | 0.017867 | 0.966667 | 0.018856 |
| 6 | balanced | 0.397242 | 0.062924 | 0.966667 | 0.018856 |
Another strategy is to use a OneVsOne approach.
This trains $N \times (N-1) / 2$ classifiers by comparing each class against each other.
When a prediction is made, the class that is selected the most is chosen (Majority Vote)$^3$.
Advantage: It is useful where algorithms do not scale well with data size (such as SVM), because each training and prediction is only needed to be run on a small subset of the data for each classifer$^{3,14}$.
Disadvantages: Can still result in ambiguious decision regions and be computationally expensive$^4$.
Notes
| param_svm_clf__class_weight | param_svm_clf__C | param_svm_clf__gamma | mean_test_score | std_test_score | |
|---|---|---|---|---|---|
| 23 | balanced | 42.715424 | 0.001768 | 0.966667 | 0.018856 |
| 50 | balanced | 11.649174 | 0.002483 | 0.966667 | 0.018856 |
| 25 | None | 106.523159 | 0.001016 | 0.966667 | 0.018856 |
| 51 | balanced | 2.137057 | 0.017867 | 0.966667 | 0.018856 |
| 6 | balanced | 0.397242 | 0.062924 | 0.966667 | 0.018856 |
scikit-learn implements macro and micro averaging methods to extend scoring metrics to multiclass problems.
The micro-average is calculated from each TPs, TNs, FPs, and FNs of the system.
For example, the micro-average precision score for a $k$-class system is,
$$ PRE_{micro} = \frac{TP_1+\ldots+TP_K}{TP_1+\ldots+TP_K+FP_1+\ldots+FP_K}. $$This is useful when we want to weight each instance or prediction equally.
The macro-average is the average scores of the different systems:
$$ PRE_{macro} = \frac{PRE_1+\ldots+PRE_K}{K}. $$This is useful when we want to evaluate the overall performance of a classifier with regard to the most frequent class labels.
There are always advantages and disadvantages to using any model on a particular dataset.
[NbConvertApp] Converting notebook 3_Applications.ipynb to html [NbConvertApp] Writing 3972266 bytes to 3_Applications.html [NbConvertApp] Converting notebook 3_Applications.ipynb to slides [NbConvertApp] Writing 3380136 bytes to 3_Applications.slides.html